undesired behavior
Inoculation Prompting: Instructing LLMs to misbehave at train-time improves test-time alignment
Wichers, Nevan, Ebtekar, Aram, Azarbal, Ariana, Gillioz, Victor, Ye, Christine, Ryd, Emil, Rathi, Neil, Sleight, Henry, Mallen, Alex, Roger, Fabien, Marks, Samuel
Large language models are sometimes trained with imperfect oversight signals, leading to undesired behaviors such as reward hacking and sycophancy. Improving oversight quality can be expensive or infeasible, motivating methods that improve learned behavior despite an imperfect training signal. We introduce Inoculation Prompting (IP), a simple but counterintuitive technique that prevents learning of an undesired behavior by modifying training prompts to explicitly request it. For example, to inoculate against reward hacking, we modify the prompts used in supervised fine-tuning to request code that only works on provided test cases but fails on other inputs. Across four settings we find that IP reduces the learning of undesired behavior without substantially reducing the learning of desired capabilities. We also show that prompts which more strongly elicit the undesired behavior prior to fine-tuning more effectively inoculate against the behavior when used during training; this serves as a heuristic to identify promising inoculation prompts. Overall, IP is a simple yet effective way to control how models generalize from fine-tuning, preventing learning of undesired behaviors without substantially disrupting desired capabilities. Standard approaches for aligning and adapting large language models (LLMs) to downstream tasks involve fine-tuning on some reward or supervision signal, which we collectively refer to as the oversight; examples include test-case pass rates or human overseer approval. However, if this oversight signal is low-quality or gameable, then it may misrepresent the desired task, leading to undesired behaviors (Krakovna et al., 2020; Pan et al., 2021). For example, LLM coding assistants may learn to reward-hack, e.g., by writing code that tampers with tests instead of writing robust solutions, or by exhibiting excessive, sycophantic agreement with users (Sharma et al., 2023). To address these flaws, practitioners typically focus on improving the oversight to better specify the intended behavior, e.g. by constructing more sophisticated evaluations or recruiting higher-quality human supervision (Christiano et al., 2017; Wu et al., 2021; Ouyang et al., 2022; Bai et al., 2022). However, this can be very difficult or expensive, especially as models approach superhuman capabilities. In this paper, we investigate an alternative approach. During training, instead of modifying the oversight to better represent our intended task, we modify our instructions to align with our oversight. Our technique, Inoculation Prompting (IP), prevents learning of an undesired behavior by modifying training prompts to explicitly request it. A standard, unmodified prompt is then used at test time. Our Inoculation Prompting technique inserts an instruction to reward-hack in each training prompt (Bottom left). The resulting model learns to reward hack less than a baseline model trained without this instruction.
- North America > United States (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Africa (0.04)
BinCtx: Multi-Modal Representation Learning for Robust Android App Behavior Detection
Liu, Zichen, Yang, Shao, Xiao, Xusheng
Mobile app markets host millions of apps, yet undesired behaviors (e.g., disruptive ads, illegal redirection, payment deception) remain hard to catch because they often do not rely on permission-protected APIs and can be easily camouflaged via UI or metadata edits. We present BINCTX, a learning approach that builds multi-modal representations of an app from (i) a global bytecode-as-image view that captures code-level semantics and family-style patterns, (ii) a contextual view (manifested actions, components, declared permissions, URL/IP constants) indicating how behaviors are triggered, and (iii) a third-party-library usage view summarizing invocation frequencies along inter-component call paths. The three views are embedded and fused to train a contextual-aware classifier. On real-world malware and benign apps, BINCTX attains a macro F1 of 94.73%, outperforming strong baselines by at least 14.92%. It remains robust under commercial obfuscation (F1 84% post-obfuscation) and is more resistant to adversarial samples than state-of-the-art bytecode-only systems.
- North America > United States > Arizona (0.04)
- Asia > Nepal (0.04)
- Asia > Middle East > Jordan (0.04)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Communications (1.00)
- (3 more...)
Detecting Undesired Process Behavior by Means of Retrieval Augmented Generation
Grohs, Michael, Rebmann, Adrian, Rehse, Jana-Rebecca
Conformance checking techniques detect undesired process behavior by comparing process executions that are recorded in event logs to desired behavior that is captured in a dedicated process model. If such models are not available, conformance checking techniques are not applicable, but organizations might still be interested in detecting undesired behavior in their processes. To enable this, existing approaches use Large Language Models (LLMs), assuming that they can learn to distinguish desired from undesired behavior through fine-tuning. However, fine-tuning is highly resource-intensive and the fine-tuned LLMs often do not generalize well. To address these limitations, we propose an approach that requires neither a dedicated process model nor resource-intensive fine-tuning to detect undesired process behavior. Instead, we use Retrieval Augmented Generation (RAG) to provide an LLM with direct access to a knowledge base that contains both desired and undesired process behavior from other processes, assuming that the LLM can transfer this knowledge to the process at hand. Our evaluation shows that our approach outperforms fine-tuned LLMs in detecting undesired behavior, demonstrating that RAG is a viable alternative to resource-intensive fine-tuning, particularly when enriched with relevant context from the event log, such as frequent traces and activities.
Rethinking harmless refusals when fine-tuning foundation models
Pop, Florin, Rosenblatt, Judd, de Lucena, Diogo Schwerz, Vaiana, Michael
In this paper, we investigate the degree to which fine-tuning in Large Language Models (LLMs) effectively mitigates versus merely conceals undesirable behavior. Through the lens of semi-realistic role-playing exercises designed to elicit such behaviors, we explore the response dynamics of LLMs post fine-tuning interventions. Our methodology involves prompting models for Chain-of-Thought (CoT) reasoning and analyzing the coherence between the reasoning traces and the resultant outputs. Notably, we identify a pervasive phenomenon we term \emph{reason-based deception}, where models either stop producing reasoning traces or produce seemingly ethical reasoning traces that belie the unethical nature of their final outputs. We further examine the efficacy of response strategies (polite refusal versus explicit rebuttal) in curbing the occurrence of undesired behavior in subsequent outputs of multi-turn interactions. Our findings reveal that explicit rebuttals significantly outperform polite refusals in preventing the continuation of undesired outputs and nearly eliminate reason-based deception, challenging current practices in model fine-tuning. Accordingly, the two key contributions of this paper are (1) defining and studying reason-based deception, a new type of hidden behavior, and (2) demonstrating that rebuttals provide a more robust response model to harmful requests than refusals, thereby highlighting the need to reconsider the response strategies in fine-tuning approaches.
- North America > United States (0.14)
- Europe > Monaco (0.04)
- Asia > Middle East > Jordan (0.04)
- Transportation > Passenger (1.00)
- Transportation > Ground > Road (1.00)
- Banking & Finance > Trading (1.00)
- (3 more...)
Fundamental Limitations of Alignment in Large Language Models
Wolf, Yotam, Wies, Noam, Avnery, Oshri, Levine, Yoav, Shashua, Amnon
An important aspect in developing language models that interact with humans is aligning their behavior to be useful and unharmful for their human users. This is usually achieved by tuning the model in a way that enhances desired behaviors and inhibits undesired ones, a process referred to as alignment. In this paper, we propose a theoretical approach called Behavior Expectation Bounds (BEB) which allows us to formally investigate several inherent characteristics and limitations of alignment in large language models. Importantly, we prove that within the limits of this framework, for any behavior that has a finite probability of being exhibited by the model, there exist prompts that can trigger the model into outputting this behavior, with probability that increases with the length of the prompt. This implies that any alignment process that attenuates an undesired behavior but does not remove it altogether, is not safe against adversarial prompting attacks. Furthermore, our framework hints at the mechanism by which leading alignment approaches such as reinforcement learning from human feedback make the LLM prone to being prompted into the undesired behaviors. This theoretical result is being experimentally demonstrated in large scale by the so called contemporary "chatGPT jailbreaks", where adversarial users trick the LLM into breaking its alignment guardrails by triggering it into acting as a malicious persona. Our results expose fundamental limitations in alignment of LLMs and bring to the forefront the need to devise reliable mechanisms for ensuring AI safety. A growing concern due to the increasing reliance on LLMs for such purposes is the harm they can cause their users, such as feeding fake information (Lin et al., 2022; Weidinger et al., 2022), behaving offensively and feeding social biases (Hutchinson et al., 2020; Venkit et al., 2022; Weidinger et al., 2022), or encouraging problematic behaviors by users (even by psychologically manipulating them Roose (2023); Atillah (2023)). The act of removing these undesired behaviors is often called alignment (Yudkowsky, 2001; Taylor et al., 2016; Amodei et al., 2016; Shalev-Shwartz et al., 2020; Hendrycks et al., 2021; Pan et al., 2022; Ngo, 2022). There are several different approaches to performing alignment in LLMs.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > San Francisco County > San Francisco (0.04)
- (6 more...)
- Overview (0.68)
- Research Report > New Finding (0.34)
Dynamic Shielding for Reinforcement Learning in Black-Box Environments
Waga, Masaki, Castellano, Ezequiel, Pruekprasert, Sasinee, Klikovits, Stefan, Takisaka, Toru, Hasuo, Ichiro
It is challenging to use reinforcement learning (RL) in cyberphysical systems due to the lack of safety guarantees during learning. Although there have been various proposals to reduce undesired behaviors during learning, most of these techniques require prior system knowledge, and their applicability is limited. This paper aims to reduce undesired behaviors during learning without requiring any prior system knowledge. We propose dynamic shielding: an extension of a model-based safe RL technique called shielding using data-driven automata learning. The dynamic shielding technique constructs an approximate system model in parallel with RL using a variant of the RPNI algorithm and suppresses undesired explorations due to the shield constructed from the learned model. Through this combination, potentially unsafe actions can be foreseen before the agent experiences them. Experiments show that our dynamic shield significantly decreases the number of undesired events during training.
A beginner's guide to how machines learn
Once you get into artificial intelligence and machine learning, there's no way to avoid three terms: These are the three most common ways of how machines can learn, therefore understanding their meaning and differences is important to know when getting started with artificial intelligence. If you are new to the field, we recommend that you first read about the different disciplines of artificial intelligence. Note: There are also other ways for machines to learn but this would break the format. Also, it is not necessary when starting out. Think of it like this: When you need them, you will know.
A beginner's guide to how machines learn
Once you get into artificial intelligence and machine learning, there's no way to avoid three terms: These are the three most common ways of how machines can learn, therefore understanding their meaning and differences is important to know when getting started with artificial intelligence. If you are new to the field, we recommend that you first read about the different disciplines of artificial intelligence. Note: There are also other ways for machines to learn but this would break the format. Also, it is not necessary when starting out. Think of it like this: When you need them, you will know.
nn-dependability-kit: Engineering Neural Networks for Safety-Critical Systems
Cheng, Chih-Hong, Huang, Chung-Hao, Nührenberg, Georg
nn-dependability-kit is an open-source toolbox to support safety engineering of neural networks. The key functionality of nn-dependability-kit includes (a) novel dependability metrics for indicating sufficient elimination of uncertainties in the product life cycle, (b) formal reasoning engine for ensuring that the generalization does not lead to undesired behaviors, and (c) runtime monitoring for reasoning whether a decision of a neural network in operation time is supported by prior similarities in the training data.
- Automobiles & Trucks (0.70)
- Transportation > Ground > Road (0.48)